This is a quick notebook to demonstrate some of the things you can do with the PDL's companies dataset. We'll use a subset of the dataset containing just the companies based within the US (to limit resource usage and keep this notebook somewhat interactive). Below is an overview of the things we'll cover in this notebook:
First let's get our environment setup and load in the necessary modules and data
# Imports
import json
import pandas as pd
import numpy as np
import folium
import matplotlib.pyplot as plt
import plotly
import plotly.graph_objects as go
import plotly.express as px
# Dataset path (subset of full dataset containing first 1,8 million US-based companies)
dataset_path = "2mm_companies_us.csv"
# Read dataset (note: this is could take ~30 sec to load)
# We'll just use the first 100K companies to improve the responsiveness
df = pd.read_csv(dataset_path, error_bad_lines=False)[:100000]
df
| name | domain | year founded | industry | size range | locality | country | linkedin url | current employee estimate | total employee estimate | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | cardinal strategies, llc | cardinal-strategies.com | 2004.0 | civil engineering | 11-50 | allen, texas, united states | united states | linkedin.com/company/cardinal-strategies-pllc | 6 | 11 |
| 1 | happy harrys discount drugs | NaN | NaN | computer software | 1-10 | pennsville, new jersey, united states | united states | linkedin.com/company/happy-harrys-discount-drugs | 0 | 2 |
| 2 | arteaga construction inc | NaN | NaN | construction | 1-10 | milwaukee, wisconsin, united states | united states | linkedin.com/company/arteaga-construction-inc | 2 | 5 |
| 3 | grail research | grailresearch.com | 2006.0 | research | 201-500 | new york, new york, united states | united states | linkedin.com/company/grail-research | 180 | 788 |
| 4 | cambridge multicultural arts center, inc. | cmacusa.org | NaN | non-profit organization management | 1-10 | cambridge, massachusetts, united states | united states | linkedin.com/company/cambridge-multicultural-a... | 1 | 11 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 99995 | our health club & spa | ourhealthclubandspa.com | NaN | hospital & health care | 1-10 | glen carbon, illinois, united states | united states | linkedin.com/company/our-health-club-&-spa | 1 | 5 |
| 99996 | chartwell dining services | NaN | NaN | food production | 1-10 | albany, new york, united states | united states | linkedin.com/company/chartwell-dining-services | 5 | 11 |
| 99997 | jc commercial realty | jccrc.com | 2010.0 | real estate | 1-10 | fort lauderdale, florida, united states | united states | linkedin.com/company/jccommercialrealty | 1 | 1 |
| 99998 | brown bus co | brownbuscompany.com | 1988.0 | transportation/trucking/railroad | 11-50 | caldwell, idaho, united states | united states | linkedin.com/company/brown-bus-co | 7 | 8 |
| 99999 | charleston light opera guild | charlestonlightoperaguild.org | NaN | entertainment | 1-10 | charleston, west virginia, united states | united states | linkedin.com/company/charleston-light-opera-guild | 1 | 2 |
100000 rows × 10 columns
Next, let's take a quick look at the dataset to better understand the data it contains
fig = px.histogram(df, x='industry', title="Number of Companies in Each Industry")
fig.update_xaxes(title_text = "")
fig.update_yaxes(title_text = "Count")
fig.show()
Seems like the most common industries represented in this dataset are:
fig = px.histogram(df, x='year founded', range_x=(1800,2020), nbins=220,
title='Number of Companies Founded Each Year')
fig.update_xaxes(title_text = "Year Founded")
fig.update_yaxes(title_text = "Count")
fig.show()
It's interesting to note that the majority of companies in the dataset were founded between 2010-2015; after that there is noticeably less data for more recent companies.
us_df = df[ df['locality'].notnull() ] # remove missing localities
us_df = us_df[ us_df['locality'].str.contains('united states') ]
us_counts = us_df['locality'].value_counts()
fig = plotly.subplots.make_subplots(rows=1, cols=2)
fig.add_trace(
go.Scatter(x=np.arange(us_counts.shape[0]), y=us_counts,
hovertext=us_counts.index, mode='markers', name="Count"),
row=1, col=1
)
fig.add_trace(
go.Scatter(x=np.arange(us_counts.shape[0]), y=np.cumsum(us_counts)/us_counts.sum(),
hovertext=us_counts.index, mode='markers', name='Cumulative Density'),
row=1, col=2
)
fig.update_layout(title_text="Number of Companies in Each US City")
fig.update_layout(hovermode='closest')
fig.update_xaxes(title_text = "Locality Index")
fig.update_yaxes(title_text = "Count", row=1, col=1)
fig.update_yaxes(title_text = "Cumulative Densitry", row=1, col=2)
fig.show()
print("Total Number of US Companies: {}".format(us_counts.sum()))
Total Number of US Companies: 98740
This figure shows the Power Law Distribution of companies within US cities (where most of the companies are concentrated in relatively few cities).
While we're at it, let's also see what this looks like geographically.
# Read latlon lookup table
locality_to_latlon = {}
with open('locality_to_latlon.json', 'r', encoding='utf-8') as f:
locality_to_latlon = json.load(f)
# The ranges for number of companies in a city
# (each city will fall into one of these ranges)
sizes = [(0, 500), (500, 1000), (1000, 5000), (5000, 10000), (10000, 100000)]
scale = 1
# Create a dataframe with all the relevant info for plotting
df_plot = pd.DataFrame(columns=['locality', 'lat', 'lon', 'size', 'sizeClass'])
df_plot['locality'] = us_counts.index
df_plot['size'] = us_counts.values
df_plot['sizeClass'] = np.digitize(df_plot['size'], bins=[size[0] for size in sizes])
def lookupLatLon(row):
locality = row['locality']
if locality in locality_to_latlon:
lat = locality_to_latlon[row['locality']][0]
lon = locality_to_latlon[row['locality']][1]
return (lat, lon)
else:
return (0, 0)
df_plot['lat'] = df_plot.apply( lambda row: lookupLatLon(row)[0], axis=1 )
df_plot['lon'] = df_plot.apply( lambda row: lookupLatLon(row)[1], axis=1 )
# Plot our map
fig = go.Figure()
colors = ["royalblue","lightseagreen","orange","purple","crimson"]
for ii in range(len(sizes))[::-1]: # Plot largest sizes first
df_sub = df_plot[ df_plot['sizeClass'] == ii+1]#.iloc[:10000]
fig.add_trace(go.Scattergeo(
locationmode = 'USA-states',
lon = df_sub['lon'],
lat = df_sub['lat'],
text = df_sub[['locality', 'size']],
marker = dict(
size = df_sub['size']/scale,
color = colors[ii],
line_color='rgb(40,40,40)',
line_width=0.5,
sizemode = 'area'
),
name = '{0} - {1}'.format(sizes[ii][0],sizes[ii][1]))
)
fig.update_layout(
title_text = 'Number of Companies in US Cities',
showlegend = True,
width=1000, height=1000,
geo = dict(
scope = 'usa',
)
)
fig.show()
Now as our first real task, let's find the 10 largest companies in the US and plot them on a map.
# Map the 10 Largest Companies in the US
# Sort companies by size
us_companies = df[ df['country'] == 'united states']
company_sizes = us_companies['total employee estimate'].values
sorted_idxs = np.argsort(company_sizes)[::-1] # descending order
sorted_idxs = sorted_idxs[:10] # top 10
# Read latlon lookup table
locality_to_latlon = {}
with open('locality_to_latlon.json', 'r', encoding='utf-8') as f:
locality_to_latlon = json.load(f)
# Plot these companies on a map
m = folium.Map(location=[39.8282, -98.5795], zoom_start=4)
count = 1
print("The 10 Largest US Companies are:")
for idx, company in us_companies.iloc [ sorted_idxs ].iterrows():
name = company['name']
locality = company['locality']
size = int ( company['total employee estimate'] )
if locality in locality_to_latlon:
latlon = locality_to_latlon[locality]
latlon = [float(x) for x in latlon]
folium.Circle(location=latlon,
popup="{}. {} - {} [{} employees]".format(count, name, locality, size),
radius=size,
fill=True,
).add_to(m)
else:
print("Could not find latlon for {}".format(name))
print(" {}. {} - {} [{} employees]".format(count, name, locality, size))
count += 1
# Show map:
m
The 10 Largest US Companies are: 1. oracle - redwood shores, california, united states [206211 employees] 2. kaiser permanente - oakland, california, united states [105240 employees] 3. convergys - cincinnati, ohio, united states [74278 employees] 4. jll - chicago, illinois, united states [59098 employees] 5. barclays investment bank - new york, new york, united states [45981 employees] 6. uber - san francisco, california, united states [36867 employees] 7. amdocs - chesterfield, missouri, united states [33663 employees] 8. cleveland clinic - cleveland, ohio, united states [32174 employees] 9. us department of homeland security - washington, district of columbia, united states [30571 employees] 10. fresenius medical care - waltham, massachusetts, united states [26649 employees]
Next, let's take a look and see if we can find any relationship between the age of a company and it's size.
# Plot company size as a function of it's age:
# A little data cleaning to get remove missing/null values
clean_df = df[ df['year founded'].notnull() ]
clean_df = clean_df[ clean_df['total employee estimate'].notnull() ]
# Plot
fig = px.scatter(clean_df, x='year founded', y='total employee estimate',
hover_name='name',
title="Company Age vs Size",
labels={"year founded": "Year Founded", "total employee estimate": "Total Employee Estimate"})
fig.update_xaxes(range=[1800, 2020])
fig.show()
Perhaps unsurprisingly, there doesn't seem to be any correlation between a company's size and the year in which it was founded. In fact, what seems to stand out the most are the outliers, where unexpectedly large companies seem to pop up with equal probability across the years.
Now, let's look at the how industries vary across different cities. We know from our previous cell that Chicago, San Francisco, and Houston have a similar number of companies, so let's see how they are distributed across industries.
sf_companies = df [ df['locality'] == 'san francisco, california, united states']
chi_companies = df[ df['locality'] == 'chicago, illinois, united states' ]
hou_companies = df[ df['locality'] == 'houston, texas, united states' ]
fig = go.Figure()
go.Histogram( )
fig.add_trace(go.Histogram(x=hou_companies['industry'], name='HOU'))
fig.add_trace(go.Histogram(x=sf_companies['industry'], name='SF'))
fig.add_trace(go.Histogram(x=chi_companies['industry'], name='CHI'))
# Overlay both histograms
fig.update_layout(barmode='overlay')
# Reduce opacity to see both histograms
fig.update_traces(opacity=0.75)
fig.show()
From this figure, it seems like Houston stands out for it's oil & energy as well as it's construction industries, while Chicago seems to have more companies in it's marketing & advertising as well as it's financial services industries. Unsurprisingly, San Francisco dominates in its internet, computer software, and information technology industries.
As you can see, there are a lot of interesting findings and useful insights that can be found by looking into this dataset. Hopefully, this gives you some ideas!